# load libraries
library(knitr)
library(tidyverse)7 Data Transformation: Organizing rows & Columns
Learning Objectives
After completing this lab you should be able to
- use
filter()andselect()to include/exclude rows and columns from a data frame. - use
distinct()to select only unique/distinct rows from a data frame. - use
separate()andunite()to separate content from a column across multiple columns and vice versa.
For each of our modules we will have a project-folder with an Rproject, *.qmd-files, and sub-directories for data, scripts, and results as described in our Rproject Tutorial. You should have a directory on your Desktop or Documents folder on your laptop (name it something like bi349) as a home directory for all of our project folders this semester.
You should have already downloaded the project directory for this module, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.
Once you have opened a project you should see the project name in the top right corner1.
1 Pro tip: If you run into issues where a quarto document won’t render or file paths aren’t working (especially if things were working previously) one of your first steps should be to double check that the correct Rproj is loaded.
There should be a document named 07_data-transformation-i.qmd in your project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set2 you may of course add as many comments as you need to be able to recall what you did]. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.
2 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.
Let’s start by loading the packages we will need for this activity.
7.1 Data Wrangling
Now that we have a data frame to play with, let’s explore some data wrangling options using dplyr. The d stands for data and the plyr stands for plyers - this package is designed to manipulate data frames3. We are going to focus on the central actions (verbs) that will allow you to manipulate the data frame.
3 This is also known as data wrangling or data munging, manipulating is not a negative thing in this case it means you can clean up and format the data in appropriate way to fit the questions you are asking and allow to to transform the information in a helpful way so that you can apply analysis and modeling as needed in the next step
The main advantages to using a command line program like R/code compared to a spreadsheet program such as Excel or Google sheets are:
- You aren’t manipulating the raw data set - if you make a mistake or accidentally overwrite something you haven’t made any permanent damage.
- You can manipulate data sets too large to easily handle in a spreadsheet
- If you update your data set or have a second identically formatted data set you just have to re-run the code.
Be sure to record all the steps (code chunks) in your quarto document - both the examples given here and the applications you will be asked to make. You can copy and paste, but you will find that writing out the code will help you get more used to syntax, how auto complete etc. works. Be sure to annotate/comment your code as reminders while we go through new functions in class, and that you take the time to go over your comments before submitting your knotted *.html document.
These are central concepts that you will use and reuse throughout the semester so you will likely want to refer back to this document. A good way to create a “cheatsheet” would be to for example for each function write a short description of what it does in general before each code chunk, then make your comment in the code specific to your example. Similarly use normal text to refer to the question numbers in this manual as you work through the problem sets.
7.2 Selecting and organizing columns
Let’s start by loading our data set.
# read catch data
catch <- read_delim("data/longline_catchdata.txt", delim = "\t")We will make heavy use of the magrittr pipe %>% this smester which allows you to link commands together; think of it as “and now do this”. R for Data Science (2e) implements the native R pipe |>. For our intents and purposes they are identical.
Because we are mostly interested in what the individual functions (verbs) do we will not always assign a new object, but just having it print to the console/below the code chunk we will be able to immediately assess the affect. By piping our function to head() it will print just the first 6 lines.
The function select() is used to select a subset of columns from a data set.
For example, you can select just the Site and Species columns4.
4 Remember, the function head() allows you to just print the first few lines of the dataframe to the console, otherwise you can end up with several thousand lines!
catch %>%
select(Site, Species) %>%
head()# A tibble: 6 × 2
Site Species
<chr> <chr>
1 Aransas_Bay Bagre_marinus
2 Aransas_Bay Bagre_marinus
3 Aransas_Bay Bagre_marinus
4 Aransas_Bay Bagre_marinus
5 Aransas_Bay Bagre_marinus
6 Aransas_Bay Bagre_marinus
You can also specify individual columns to eliminate by name. For example, the PCL column doesn’t contain any information (all NAs).
catch %>%
select(-PCL) %>%
head()# A tibble: 6 × 11
Site Species Sex Observed_Stage FL STL Hook_Size Set Day Month
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas_… Bagre_… U <NA> 287 353 10 1 28 7
2 Aransas_… Bagre_… U <NA> 425 495 10 1 28 7
3 Aransas_… Bagre_… U <NA> 416 502 15 1 28 7
4 Aransas_… Bagre_… U <NA> 416 507 10 1 28 7
5 Aransas_… Bagre_… U <NA> 418 510 15 1 28 7
6 Aransas_… Bagre_… U <NA> 434 515 10 1 28 7
# ℹ 1 more variable: Year <dbl>
You can also eliminate multiple columns by name, for example you would remove Day, Month and Year like this:
catch %>%
select(-Day, -Month, -Year) %>%
head()# A tibble: 6 × 9
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas_Bay Bagre_mari… U <NA> NA 287 353 10 1
2 Aransas_Bay Bagre_mari… U <NA> NA 425 495 10 1
3 Aransas_Bay Bagre_mari… U <NA> NA 416 502 15 1
4 Aransas_Bay Bagre_mari… U <NA> NA 416 507 10 1
5 Aransas_Bay Bagre_mari… U <NA> NA 418 510 15 1
6 Aransas_Bay Bagre_mari… U <NA> NA 434 515 10 1
If you want to re-arrange columns in your data frame, you would also use select().
catch %>%
select(FL, Sex, Day) %>%
head()# A tibble: 6 × 3
FL Sex Day
<dbl> <chr> <dbl>
1 287 U 28
2 425 U 28
3 416 U 28
4 416 U 28
5 418 U 28
6 434 U 28
If you wanted to move a set of columns to the front, but not not want to have to type in all the other column names you can use everything().
catch %>%
select(Day, Month, Year, everything()) %>%
head()# A tibble: 6 × 12
Day Month Year Site Species Sex Observed_Stage PCL FL STL
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 28 7 2015 Aransas_Bay Bagre_ma… U <NA> NA 287 353
2 28 7 2015 Aransas_Bay Bagre_ma… U <NA> NA 425 495
3 28 7 2015 Aransas_Bay Bagre_ma… U <NA> NA 416 502
4 28 7 2015 Aransas_Bay Bagre_ma… U <NA> NA 416 507
5 28 7 2015 Aransas_Bay Bagre_ma… U <NA> NA 418 510
6 28 7 2015 Aransas_Bay Bagre_ma… U <NA> NA 434 515
# ℹ 2 more variables: Hook_Size <dbl>, Set <dbl>
There you go, creating subsets of columns: Simple as that.
7.3 Separating & uniting columns
Occasionally you will find that you want to combine the contents of two columns into a single column (e.g. first name, last name) or at other times you may want to separate the contents of a column over multiple columns (e.g. dates).
For example, you may have noticed that the Species is entered as genus_species - what if you wanted to have two separate columns with that information?
The function separate() will split the contents from one column across two or more columns. To do this you need to specify the new column names (into = c("column1", "column2")), and what pattern should be used to determine where the content should be split (sep = "pattern").
catch %>%
separate(Species, into = c("species", "genus"), sep = "_", remove = FALSE) %>%
head()# A tibble: 6 × 14
Site Species species genus Sex Observed_Stage PCL FL STL Hook_Size
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Aransa… Bagre_… Bagre mari… U <NA> NA 287 353 10
2 Aransa… Bagre_… Bagre mari… U <NA> NA 425 495 10
3 Aransa… Bagre_… Bagre mari… U <NA> NA 416 502 15
4 Aransa… Bagre_… Bagre mari… U <NA> NA 416 507 10
5 Aransa… Bagre_… Bagre mari… U <NA> NA 418 510 15
6 Aransa… Bagre_… Bagre mari… U <NA> NA 434 515 10
# ℹ 4 more variables: Set <dbl>, Day <dbl>, Month <dbl>, Year <dbl>
In other cases you might have information in two columns that you want to combine into a single column. This can be accomplished using the function unite().
For example, if we wanted to create a column called date that had the day, month, and year of each sampling trip separated by an _.
catch %>%
unite(Date, Day, Month, Year, sep = "_", remove = FALSE) %>%
head()# A tibble: 6 × 13
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Date
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Aransas_… Bagre_… U <NA> NA 287 353 10 1 28_7…
2 Aransas_… Bagre_… U <NA> NA 425 495 10 1 28_7…
3 Aransas_… Bagre_… U <NA> NA 416 502 15 1 28_7…
4 Aransas_… Bagre_… U <NA> NA 416 507 10 1 28_7…
5 Aransas_… Bagre_… U <NA> NA 418 510 15 1 28_7…
6 Aransas_… Bagre_… U <NA> NA 434 515 10 1 28_7…
# ℹ 3 more variables: Day <dbl>, Month <dbl>, Year <dbl>
7.4 Sorting dataframes by a specific column content
Until you want to visualize a table how the rows are arranged is not really important. However, for example, when generating reports you might want values to be listed in a specific way. This can be done using the function arrange().
For example, if we wanted to sort our dataframe based on the Observed_Stage column we could do the following:
catch %>%
arrange(Observed_Stage)# A tibble: 2,325 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Redfish… Sphyrn… M MAT 622 668 850 10 1 8
2 Redfish… Sphyrn… M MAT 656 710 869 10 1 8
3 Redfish… Sphyrn… F MAT 708 770 979 15 2 8
4 Corpus_… Sphyrn… M MAT 695 757 954 10 2 12
5 Corpus_… Sphyrn… F MAT 760 861 1090 10 2 12
6 Corpus_… Sphyrn… M MAT 621 689 856 10 2 27
7 Redfish… Sphyrn… F MAT 781 853 1020 10 4 29
8 Redfish… Sphyrn… M MAT 721 783 980 10 3 11
9 Redfish… Carcha… U UND NA NA NA 15 2 16
10 Corpus_… Sphyrn… U UND NA NA NA 10 1 27
# ℹ 2,315 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
By default, characters are sorted alphabetically, numeric columns from smallest to largest value. If you want to order your values from largest to smallest, you can specify that using desc()
catch %>%
arrange(desc(FL))# A tibble: 2,325 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas… Carcha… F <NA> 1042 1140 1410 15 1 25
2 Redfish… Carcha… F <NA> 812 900 1090 15 1 16
3 Redfish… Carcha… M <NA> 792 882 1092 15 2 16
4 Corpus_… Sphyrn… F MAT 760 861 1090 10 2 12
5 Redfish… Sphyrn… F MAT 781 853 1020 10 4 29
6 Corpus_… Sciaen… U <NA> NA 841 950 10 3 25
7 Redfish… Carcha… M <NA> 740 840 1010 15 3 29
8 Redfish… Carcha… M <NA> 740 820 1020 10 4 1
9 Aransas… Carcha… M <NA> 720 812 912 15 4 22
10 Redfish… Sphyrn… M MAT 721 783 980 10 3 11
# ℹ 2,315 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
7.5 Filtering (subsetting) rows
Frequently, we are less interested in being able to sort columns by content, rather, we want to extract a subset of rows based on specific content.
The function filter() is used to subset a data frame by row based on regular expressions and the boolean operators we previously encounter to describe the content of sets of rows.
For example, we might a data.frame with only Gafftop sail catfish (Bagre marinus)5.
5 Remember for exact matches we use == not =
catch %>%
filter(Species == "Bagre_marinus")# A tibble: 1,511 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas… Bagre_… U <NA> NA 287 353 10 1 28
2 Aransas… Bagre_… U <NA> NA 425 495 10 1 28
3 Aransas… Bagre_… U <NA> NA 416 502 15 1 28
4 Aransas… Bagre_… U <NA> NA 416 507 10 1 28
5 Aransas… Bagre_… U <NA> NA 418 510 15 1 28
6 Aransas… Bagre_… U <NA> NA 434 515 10 1 28
7 Aransas… Bagre_… U <NA> NA 427 520 15 1 28
8 Aransas… Bagre_… U <NA> NA 446 532 10 1 28
9 Aransas… Bagre_… U <NA> NA 465 538 10 1 28
10 Aransas… Bagre_… U <NA> NA 450 539 10 1 28
# ℹ 1,501 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
If we want all rows but Gafftop sailfish you can use a ! to say “not that” instead of having to list all the species that you do want to keep6.
6 This is frequently called “blacklisting”, while creating a list of content that you do want to keep would be referred to as “whitelisting”.
catch %>%
filter(!Species == "Bagre_marinus")# A tibble: 814 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Redfish… Rhizop… M <NA> 351 378 433 15 1 29
2 Redfish… Sphyrn… F <NA> 470 430 600 10 3 29
3 Redfish… Hypanu… F <NA> NA NA 670 15 3 29
4 Redfish… Hypanu… F <NA> NA NA 340 10 3 29
5 Redfish… Hypanu… M <NA> NA NA 810 10 4 29
6 Corpus_… Carcha… F <NA> 609 670 820 15 1 30
7 Corpus_… Sphyrn… M <NA> 495 485 615 10 2 24
8 Corpus_… Sphyrn… F <NA> 550 370 720 10 2 24
9 Corpus_… Sphyrn… M <NA> 470 505 645 10 3 24
10 Corpus_… Sphyrn… F <NA> 540 565 720 10 3 24
# ℹ 804 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
Sometimes you might want to select rows that match one of a set of values7. In this case we would use %in% to indicate “keep any of these”.
7 Recall, the function c() (concatenate) creates a vector
catch %>%
filter(Species %in% c("Sciades_felis", "Bagre_marinus", "Synodus_foetens"))# A tibble: 2,166 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas… Bagre_… U <NA> NA 287 353 10 1 28
2 Aransas… Bagre_… U <NA> NA 425 495 10 1 28
3 Aransas… Bagre_… U <NA> NA 416 502 15 1 28
4 Aransas… Bagre_… U <NA> NA 416 507 10 1 28
5 Aransas… Bagre_… U <NA> NA 418 510 15 1 28
6 Aransas… Bagre_… U <NA> NA 434 515 10 1 28
7 Aransas… Bagre_… U <NA> NA 427 520 15 1 28
8 Aransas… Bagre_… U <NA> NA 446 532 10 1 28
9 Aransas… Bagre_… U <NA> NA 465 538 10 1 28
10 Aransas… Bagre_… U <NA> NA 450 539 10 1 28
# ℹ 2,156 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
Again, if you wanted everything but rows containing those values you would preface it with a !.
catch %>%
filter(!Species %in% c("Sciades_felis", "Bagre_marinus", "Synodus_foetens"))# A tibble: 159 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Redfish… Rhizop… M <NA> 351 378 433 15 1 29
2 Redfish… Sphyrn… F <NA> 470 430 600 10 3 29
3 Redfish… Hypanu… F <NA> NA NA 670 15 3 29
4 Redfish… Hypanu… F <NA> NA NA 340 10 3 29
5 Redfish… Hypanu… M <NA> NA NA 810 10 4 29
6 Corpus_… Carcha… F <NA> 609 670 820 15 1 30
7 Corpus_… Sphyrn… M <NA> 495 485 615 10 2 24
8 Corpus_… Sphyrn… F <NA> 550 370 720 10 2 24
9 Corpus_… Sphyrn… M <NA> 470 505 645 10 3 24
10 Corpus_… Sphyrn… F <NA> 540 565 720 10 3 24
# ℹ 149 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
For numbers you likely aren’t just searching for exact matches, you also want to be able to set threshold values and select everything above or below. For example, you can select all rows with values greater than a certain value using >.
catch %>%
filter(FL > 440)# A tibble: 907 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas… Bagre_… U <NA> NA 446 532 10 1 28
2 Aransas… Bagre_… U <NA> NA 465 538 10 1 28
3 Aransas… Bagre_… U <NA> NA 450 539 10 1 28
4 Aransas… Bagre_… U <NA> NA 496 565 10 1 28
5 Aransas… Bagre_… U <NA> NA 476 569 10 1 28
6 Aransas… Bagre_… U <NA> NA 495 570 10 1 28
7 Aransas… Bagre_… U <NA> NA 490 575 10 1 28
8 Aransas… Bagre_… U <NA> NA 486 581 10 1 28
9 Aransas… Bagre_… U <NA> NA 503 589 10 1 28
10 Aransas… Bagre_… U <NA> NA 489 590 10 1 28
# ℹ 897 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
In this second piece of code you used a single boolean operator to include two conditions, “smaller than” and “equal two”. That is a special case of wanting to retain data than fulfills one of either of two conditions and we have a specific boolean operator that can combine the two.
This is not always the case, for example, you might want to retain data that fulfills conditions in two different columns. In this case you can combine expressions using & to indicate that it must fulfill all conditions indicated or | to indicate that it must retain at least one of the.
For example to select only scalloped hammerheads that are also smaller than 300 cm you would use
catch %>%
filter(Species == "Sphyrna_lewini" & FL < 300)# A tibble: 1 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Corpus_C… Sphyrn… F <NA> 192 210 280 15 3 6
# ℹ 2 more variables: Month <dbl>, Year <dbl>
By contrast, if you wanted all entries that are either gafftops or a fork length smaller than 300 cm you could use the following code:
catch %>%
filter(Species == "Sphyrna_lewini" | FL < 300)# A tibble: 409 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas… Bagre_… U <NA> NA 287 353 10 1 28
2 Aransas… Bagre_… U <NA> NA 285 314 10 2 28
3 Corpus_… Bagre_… U <NA> NA 299 348 10 1 30
4 Corpus_… Bagre_… U <NA> NA 297 367 10 2 30
5 Corpus_… Bagre_… U <NA> NA 298 362 10 3 30
6 Corpus_… Bagre_… U <NA> NA 290 350 10 2 24
7 Redfish… Bagre_… U <NA> NA 254 284 10 4 8
8 Aransas… Bagre_… U <NA> NA 50 574 10 1 25
9 Aransas… Bagre_… U <NA> NA 280 340 3 3 25
10 Redfish… Bagre_… U <NA> NA 294 353 10 4 16
# ℹ 399 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
7.6 Sneak peak: grouping rows for specific wrangling actions
We have already see that it can be helpful to subset rows based on conditions that are met by the content of more than one column. In those cases, we were creating conditions based on Boolean operators.
In many cases we might be interested in subsetting a dataframe in a way where our conditions cannot be expressed by a TRUE/FALSE scenario using Boolean operators.
For example, we might want to extract the data entry for the longest fish in the data set based on forklength.
The function max() can be used to get the maximum value for a vector of numbers. In this case, the vector we are looking at is the FL column of the catch dataframe.
catch %>%
filter(FL == max(FL, na.rm = TRUE))# A tibble: 1 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas_… Carcha… F <NA> 1042 1140 1410 15 1 25
# ℹ 2 more variables: Month <dbl>, Year <dbl>
That’s great, now we now what the largest fish is that we caught.
What about if we wanted to subset the dataframe to retain the largest fish based on forklength for each species?
The tidyverse has a central concept call “split-apply-combine”, which means that occasionally we want to group entries in a dataframe (split), do some sort of manipulation (apply), but end up with a single data frame (combine). We will look at how useful this is in the next chapter but let’s take a quick sneak peak at how this is implemented in dplyr using group_by().
catch %>%
group_by(Species) %>%
filter(FL == max(FL, na.rm = TRUE)) %>%
ungroup()# A tibble: 12 × 12
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Corpus_… Bagre_… U <NA> NA 575 640 10 2 24
2 Aransas… Rhizop… F <NA> 580 637 790 10 2 25
3 Aransas… Carcha… F <NA> 1042 1140 1410 15 1 25
4 Redfish… Carcha… F <NA> 812 900 1090 15 1 16
5 Corpus_… Sphyrn… F MAT 760 861 1090 10 2 12
6 Aransas… Carcha… F <NA> 690 757 940 10 2 22
7 Corpus_… Sphyrn… F <NA> 520 578 770 10 1 21
8 Corpus_… Carcha… U <NA> 335 415 475 10 1 3
9 Aransas… Sciade… U <NA> NA 480 548 10 2 18
10 Aransas… Sciade… U <NA> NA 480 580 10 2 18
11 Corpus_… Sciaen… U <NA> NA 841 950 10 3 25
12 Corpus_… Synodu… U <NA> NA 173 185 10 3 30
# ℹ 2 more variables: Month <dbl>, Year <dbl>
This is also an example of how we can use the pipe (%>%) to string a bunch of commands, in this example we are saying “take the object catch, and then group rows by Species and then for each group retain only the maximum forklength value for that group and then ungroup them again.”
Specifically for cases where we want to retain the largest or smallest values, we can use family of of functions called slice() which allow us to subset rows based on their position.
For example, we can retain the largest 5 individuals per species based on forklength using slice_max()
catch %>%
group_by(Species) %>%
slice_max(order_by = FL, n = 5)# A tibble: 66 × 12
# Groups: Species [14]
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Corpus_… Bagre_… U <NA> NA 575 640 10 2 24
2 Corpus_… Bagre_… U <NA> NA 574 676 10 2 21
3 Aransas… Bagre_… U <NA> NA 565 NA 10 1 17
4 Redfish… Bagre_… U <NA> NA 564 651 15 1 29
5 Aransas… Bagre_… U <NA> NA 555 541 10 3 13
6 Aransas… Bagre_… U <NA> NA 555 NA 15 1 17
7 Redfish… Carcha… F <NA> 812 900 1090 15 1 16
8 Redfish… Carcha… M <NA> 792 882 1092 15 2 16
9 Redfish… Carcha… M <NA> 740 820 1020 10 4 1
10 Redfish… Carcha… M <NA> 660 722 880 10 1 20
# ℹ 56 more rows
# ℹ 2 more variables: Month <dbl>, Year <dbl>
7.7 Create a subset with only unique entries
Occasionally, you might want to create a subset of the data set that shows only the unique (distinct) entries for a specific column; this is especially common during an exploratory analysis of a data set that you are getting an overview of. This can be achieved using the function distinct().
For example, we might want to know which years the survey took place.
catch %>%
distinct(Year)# A tibble: 4 × 1
Year
<dbl>
1 2015
2 2016
3 2017
4 2018
Notice how that dropped all the other columns. You can switch that off using .keep_all = FALSE.
You can also combine columns. For example if we wanted to determine the individual sets of the data set we could use
catch %>%
distinct(Day, Month, Year, Set)# A tibble: 197 × 4
Day Month Year Set
<dbl> <dbl> <dbl> <dbl>
1 28 7 2015 1
2 28 7 2015 2
3 28 7 2015 3
4 28 7 2015 4
5 29 7 2015 1
6 29 7 2015 2
7 29 7 2015 4
8 30 7 2015 1
9 30 7 2015 2
10 30 7 2015 3
# ℹ 187 more rows